We have been trying to pursue a topic modeling approach to the clustering of single cells (Jaitin et al 2014) or human tissues (GTEX project) based on the RNA Seq read counts data. Usually we are provided with a \(N \times M\) table, where the \(N\) rows correspond to the tissue samples or the single cells while the \(M\) columns correspond to the genes. Usually in such settings \(M >> N\), that further complicates the computations. An investigtaor may be interested in the clustering of the tissues or the single cells into structurally or functionally homogeneous cluster. While hierarchical clustering or the multinomial model based clustering are the usual techniques adopted for clustering the RNA Seq read counts data, we have been trying to implement a topic model or admixture type algorithm to the RNA Seq reads data, an approach that is very popular in population genetics for clustering the genotypic data or in document clustering as in NLP (Natural Language Processing). However, a very important problem in dealing with this problem is however dealing with the batch effects which are really significant in RNA-Seq data and as a result if we do not account for this problem, our clusters may be influenced by this technical effect. So, much of our focus in this report is to deal with this batch effect problem and to obtain the clusters or th topics solely determined by the biological variation and not the technical effects.
In dealing with this, we first decided to build a normal topic model with and without batch effect. Note that, in general in RNA seq reads table, we are given the counts of the number of reads corresponding to a particular gene in a particular tissue sample. A normal fit to this counts data is not recommended, but it is also true that often investigators prefer working on the RPKM values rather than the reads data anf the normal assumption to the data is very commonplace in such analysis. This was to some extent our biological motivation to start off with the normal model. The other motivation was that the normal model is usually the simplest statistical model to deal with.
Let \(c_{ng}\) be the transformed read counts data (RPKM values may be) for sample or individual \(n\) and gene \(g\), where we assume that \(c_{ng}\) has a normal distribution.
Then we can write the model as
\[ c_{ng} = \sum_{k=1}^{K} \omega_{nk}\alpha_{kg} + u_{ng}, \hspace{1.5 cm} \omega_{nk} \geq 0, \hspace{0.5 cm} \sum_{k=1}^{K} \omega_{nk}=1 \]
where there is assumed to be correlation between \(u_{ng}\) and \(u_{n^{'}:g}\) if \(n \neq n^{'}\). We can assume this correlation to be constant as well. \[ cor(u_{ng},u_{n^{'}g}) =\rho_{g} \hspace{.5 cm} n \neq n^{'} \hspace{0.5 cm} b(n)=b(n^{'}) \]
We assume thats why that there is a random effect term \(\beta_{b(n):g}\) involved in the model. So another formulation of the above model can be written as
\[ c_{ng} = \sum_{k=1}^{K} \omega_{nk}\alpha_{kg} + \beta_{b(n):g} + e_{ng}, \hspace{1.5 cm} \omega_{nk} \geq 0, \hspace{0.5 cm} \sum_{k=1}^{K} \omega_{nk}=1 \]
where \(e_{ng}\)’s represent the measurement error and hence may be treated as independent. The random effect \(\beta_{b(n):g}\) is assumed to follow the distribution
\[ \beta_{b(n):g} \sim N(0, \eta^{2}_{g}) \]
So for a fixed \(g\), the random effects are assumed to be derived from the same normal distribution for individual batches \(b(n)\). Note that in the model we just described, the only term that can occur in the model over multiple genes are the topic weights \(\omega_{nk}\). Since we are separately estimating the \(\omega_{nk}\) using active set strategy and then plugging the estmates in these equations, therefore, \(\omega\)’s can be held constants in this model and that would imply it suffices to look at the model separately and try to estimate the parameters.
First we assume a further simplication of the normal model, that There is NO BATCH EFFECT. We try to estimate that model first.
The model can be written down as
\[ c_{ng} = \sum_{k=1}^{K} \omega_{nk}\alpha_{kg} + e_{ng}, \hspace{1.5 cm} \omega_{nk} \geq 0, \hspace{0.5 cm} \sum_{k=1}^{K} \omega_{nk}=1 \]
where \(e_{ng} \sim N(0,\sigma^2)\) are the measurement error terms. This when viewed differently reduced to the approximate matrix factorization of the RPKM values matrix \(C_{N \times G}\) into two matrices \(W_{N \times K}\) and \(H_{K \times G}\). We wish to determine \(W\) and \(H\) such that
\[ C_{N \times G} \approx W_{N \times K} H_{K \times G} \hspace{1.5 cm} W1=1 \hspace{0.5 cm} W_{kg} \geq 0\]
We could not find any standard algorithms in the literature that performs this factorization. Not that by definition the matrices \(C\) and \(W\) are non-negative matrices but we do not require \(H\) to be a non-negative matrix. However, we can think of adding a constant big term to each of the elements of the matrix \(C\) and the matrix \(H\) to force the matrix \(H\) to be non-negative as well. This means that essentially we want to compute the NMF factorization with some constraints on the matrix \(W\). In order to solve this problem we try to minimize the Frobenius norm of the matrices \(C\) and \(WH\) with respect to the matrices \(W\) and \(H\).
\[ (W^{\star},H^{\star}) = \underset{W1=1,W \geq 0,H}{argmin} || C - WH ||_{F} \]
which is equivalent to
\[ (W^{\star},H^{\star}) = \underset{\begin{pmatrix} \omega_{nk} \geq 0 \;\; \forall n \;\; \forall k \\ \sum_{k=1}^{K} \omega_{nk}=1 \end{pmatrix}}{argmin} \sum_{n=1}^{N}\sum_{g=1}^{G} (c_{ng} -\sum_{k=1}^{K} \omega_{nk}h_{kg})^2 \]
In doing so, we first start with an initial starting matrix \(W^{(0)}\) and the corresponding entries \(\omega^{(0)}_{nk}\). We start with this matrix and given that, at the \(m\)th step, given \(W^{(m)}\),we estimate the matrix \(H\) as follows
\[ H^{(m)} = ({W^{(m)}}^{T}W^{(m)})^{-1}{W^{(m)}}^{T}C \]
Now given \(H^{(m)}\), we then obtain the proportions \(\omega^{(m+1)}_{nk}\) and hence the matrix \(W^{(m+1)}\). For each \(n\), we know that \(\omega_{nk} \geq 0\) and \(\sum_{k=1}^{K} \omega_{nk}=1\) and we transform this vector \(\omega_{n,\star} = \left \{ \omega_{nk}: k=1,2,\cdots,K \right \}\) into the vector \(v_{n,\star}\), a \((K-1)\) vector such that
\[v_{nl} = log \left (\frac{\omega_{nl}+\epsilon}{\omega_{nK}+\epsilon} \right ) \hspace{1 cm} \epsilon=1e-10, \hspace{1 cm} l=1,2,\cdots, (K-1) \]
We can therefore write \(\omega_{n,\star}\) s in terms of \(v_{n,\star}\) as
\[ \omega_{nK} = \frac{1+(K-1)\epsilon -\epsilon \sum_{l=1}^{K-1} exp(v_{nl})} {\left(1+\sum_{l=1}^{K-1}exp(v_{nl}) \right)}\approx \frac{1}{1+\sum_{l=1}^{K-1}exp(v_{nl})} =f_{nK}(v_{n,\star}) \]
\[ \omega_{nl} = exp(v_{nl}) (\omega_{nK}+\epsilon) -\epsilon = f_{nl}(v_{n,\star}), \hspace{1 cm} l=1,2,\cdots,(K-1) \]
So the above minimization problem in terms of \(\omega\) reduces to the following minimization problem in \(v\).
\[ v^{(m+1)}_{n,\star} = \underset{v_{n,\star}\; \forall n}{argmin} \sum_{n=1}^{N}\sum_{g=1}^{G} (c_{ng} -\sum_{k=1}^{K} f_{nk}(v_{n,\star}) \alpha^{(m)}_{kg})^2 \]
where \(h^{(m)}_{kg}= \alpha^{(m)}_{kg}\). This is a straightforward unconstrained optimization which can be done by a BFGS type algorithm easily in R (the function optim is used)
Then we define
\[ \omega^{(m+1)}_{nk} = f_{nk} (v^{(m+1)}_{n,\star}) \]
and thus we define each element or entry of the matrix \(W^{(m+1)}\) in this way. In this way, we continue iterating until either we reach MaxIter (which I set as 500 in all my codes) or the difference between two consecutive values of the optimizing function is less than \(1e-07\).
We simulate a normal topic model without batch effect, where we match the true topic weights matrix \(W\) represented as a Structure plot with the estimated topic weights matrix for three cases- one with no error (exact model), one with error variance \(0.01\) and finally with one with error variance \(0.09\). The simulation set up is given in the following R chunk
K=4;
G=100;
N=100;
alpha_true=1000+matrix(rnorm((K)*G,0.5),nrow=(K)); ### the matrix of fixed effects
Label.Batch=c(rep(1,N/5),rep(2,N/5),rep(3,N/5),rep(4,N/5),rep(5,N/5));
B=max(Label.Batch);
sigmab_true=2;
beta_true=matrix(0,B,G); ### the matrix of the random effect
for(g in 1:G)
{
beta_true[,g]=rnorm(B,mean=0,sd=sigmab_true);
}
library(gtools)
T=2;
omega_true=matrix(rbind(rdirichlet(T*10,c(3,4,2,6)),rdirichlet(T*10,c(1,4,6,3)),
rdirichlet(T*10,c(4,1,2,2)),rdirichlet(T*10,c(2,6,3,2)),
rdirichlet(T*10,c(3,3,5,4))), nrow=N);
H_true=alpha_true;
sd=0;
# sd=0.01;
# sd=0.09;
### generating the table
read_counts=matrix(0,N,G);
for(n in 1:N)
{
read_counts[n,]=omega_true[n,]%*%H_true +sd*rnorm(G,0,1);
}
This seems like our estimates are pretty close to the actual topic proportion weights. So, we decide to drop the assumption of no batch effect now and consider the more general normal topic model with the batch effect (which is introduced here as a random effect).
So, the model we are considering now is
\[ c_{ng} = \sum_{k=1}^{K} \omega_{nk}\alpha_{kg} + \beta_{b(n):g} + e_{ng}, \hspace{1.5 cm} \omega_{nk} \geq 0, \hspace{0.5 cm} \sum_{k=1}^{K} \omega_{nk}=1 \]
where
\[ \beta_{b(n):g} \sim N(0, \eta^{2}_{g}) \]
and \(e_{ng} \sim N(0,\sigma^2)\).
In this case, we follow a technique very similar to the pne in the previous section. We start with some initial values of \(\omega^{(0)}_{nk}\), and given the value of the topic proportion weight \(\omega^{(m)}_{nk}\) at the \(m\)th stage, we fit a LMM model for a fixed \(g\) as follows
\[ c_{n,g} = \sum_{k=1}^{K} \alpha_{kg} \omega^{(m)}_{nk} + \beta_{b(n):g} + e_{ng} \hspace{1 cm} \forall n, \hspace{0.2 cm} g \; fixed \]
Then \(\alpha_{kg}\) is basically the coefficient of the regressor \(\omega^{(m)}_{\star,k}\) and this is a linear mixed model (LMM) with \(K\) many regressors given by \(\omega^{(m)}_{nk}\) and one random effect, for each \(g\). We solve this using the lmer function of the package lme4 in R, and thus obtain the estimates of the regression coefficients \(\alpha_{kg}\) and the random effect \(\beta_{b(n):g}\). We call these estimated values \(\alpha^{(m)}_{kg}\) and \(\beta^{(m)}_{bg}\) for \(b=1,2,\cdots,B\), \(k=1,2,\cdots,K\) and \(g=1,2,\cdots,G\). Given these estimates we can determine \(\omega^{(m+1)}_{nk}\) as above
\[ v^{(m+1)}_{n,\star} = \underset{v_{n,\star}\; \forall n}{argmin} \sum_{n=1}^{N}\sum_{g=1}^{G} (c_{ng} -\sum_{k=1}^{K} f_{nk}(v_{n,\star}) \alpha^{(m)}_{kg})^2 \]
\[ \omega^{(m+1)}_{nk} = f_{nk} (v^{(m+1)}_{n,\star}) \]
We contnue our iterations till stoppinf criterion is reached as in the above section. The simulation set up and the output are given as follows
################## Set up for Counts Table ########################
K=4;
G=100;
N=100;
alpha_true=matrix(rnorm((K)*G,0.5),nrow=(K)); ### the matrix of fixed effects
Label.Batch=c(rep(1,N/5),rep(2,N/5),rep(3,N/5),rep(4,N/5),rep(5,N/5));
B=max(Label.Batch);
sigmab_true=2;
beta_true=matrix(0,B,G); ### the matrix of the random effect
for(g in 1:G)
{
beta_true[,g]=rnorm(B,mean=0,sd=sigmab_true);
}
library(gtools)
T=2;
omega_true=matrix(rbind(rdirichlet(T*10,c(3,4,2,6)),rdirichlet(T*10,c(1,4,6,3)),
rdirichlet(T*10,c(4,1,2,2)),rdirichlet(T*10,c(2,6,3,2)),
rdirichlet(T*10,c(3,3,5,4))), nrow=N);
H_true=alpha_true+1000;
alpha_true=H_true
### generating the table
sd=0;
# sd=0.01;
# sd=0.09;
read_counts=matrix(0,N,G);
for(n in 1:N)
{
read_counts[n,]=omega_true[n,]%*%H_true +beta_true[Label.Batch[n],]+sd*rnorm(G,0,1);
}
In this section, we generate a table of Poisson counts, here we assume
\[ c_{ng} \sim Poi(\lambda_{n}(g)) \]
\[ log(\lambda_{n}(g)) = \sum_{k=1}^{K} \omega_{nk}\alpha_{kg} + \beta_{b(n):g} + e_{ng} \]
Here \(\alpha_{kg}\) represents the fixed effect or the mixing coefficient for the topics or the clusters and the term \(\beta_{b(n):g}\) is the random intercept term for the batch \(b(n)\) and the gene \(g\) an the term \(e_{ng}\) accounts for overdispersion. We assume that
\[ \beta_{b:g} \sim N(0, \sigma^{2}_{g}) \]
We consider two cases, one in which there is no random effect. The only change to this model from the model presented above is that
\[ log(\lambda_{n}(g)) = \sum_{k=1}^{K} \omega_{nk}\alpha_{kg} \]
We see how good is the normal topic model determining the actual topic proportions. The simulation set up first for the case with no batch effect is given as follows
K=4;
G=100;
N=100;
alpha_true=matrix(rnorm((K)*G,0.5),nrow=(K)); ### the matrix of fixed effects
library(gtools)
T=2;
omega_true=matrix(rbind(rdirichlet(T*10,c(3,4,2,6)),rdirichlet(T*10,c(1,4,6,3)),
rdirichlet(T*10,c(4,1,2,2)),rdirichlet(T*10,c(2,6,3,2)),
rdirichlet(T*10,c(3,3,5,4))), nrow=N);
H_true=alpha_true+1;
alpha_true=H_true
### generating the table
read_counts=matrix(0,N,G);
for(n in 1:N)
{
for(g in 1:G)
{
mean=exp(omega_true[n,]%*%H_true[,g]+0*rnorm(K,0,1));
read_counts[n,g]=rpois(1,mean);
}
}
We present the Structure plots for the true topic proportion weights, the estimated topic proportion weights from the usual topic model (due to Matt Taddy, as obtained from the fuction topics of the library maptpx in R) and that from the normal topic model that we have been presented so far.
Next we consider the generalized model with a single batch effect. The simulation set up for this case is presented in the following R chunk
K=4;
G=100;
N=100;
alpha_true=matrix(rnorm((K)*G,0.5),nrow=(K)); ### the matrix of fixed effects
Label.Batch=c(rep(1,N/5),rep(2,N/5),rep(3,N/5),rep(4,N/5),rep(5,N/5));
B=max(Label.Batch);
sigmab_true=2;
beta_true=matrix(0,B,G); ### the matrix of the random effect
for(g in 1:G)
{
beta_true[,g]=rnorm(B,mean=0,sd=sigmab_true);
}
library(gtools)
T=2;
omega_true=matrix(rbind(rdirichlet(T*10,c(3,4,2,6)),rdirichlet(T*10,c(1,4,6,3)),
rdirichlet(T*10,c(4,1,2,2)),rdirichlet(T*10,c(2,6,3,2)),
rdirichlet(T*10,c(3,3,5,4))), nrow=N);
H_true=alpha_true+1;
alpha_true=H_true
### generating the table
read_counts=matrix(0,N,G);
for(n in 1:N)
{
for(g in 1:G)
{
mean=exp(omega_true[n,]%*%H_true[,g] +beta_true[Label.Batch[n],g]+0*rnorm(K,0,1));
read_counts[n,g]=rpois(1,mean);
}
}
The corresponding structure plots of the true topic proportions, the estimated topic model proportions and that corresponding to the normal fitting topic model as prescribed in this section as follows. Here, we fit normal model in two ways. In one case, we just factorize the counts matrix (not transformed) \(C\) as \(WH\) and report \(W\), on the other hand, we fit the LMM model (linear mixed model) to the \(log(c_{ng}+\epsilon)\) terms with the mean given by \(\lambda_{ng}\).
We consider the same model as before, however we shall not fit a normal topic model, rather we shall try to incorporate the random effect in the Poisson topic model.
We start with some initial values of \(\omega^{(0)}_{nk}\), and given the value of the topic proportion weight \(\omega^{(m)}_{nk}\) at the \(m\)th stage, we fit a LMM model for a fixed \(g\) as follows
\[ c_{n,g} \sim Poi \left ( exp \left (\sum_{k=1}^{K} \alpha_{kg} \omega^{(m)}_{nk} + \beta_{b(n):g} \right ) \right) \hspace{1 cm} \forall n, \hspace{0.2 cm} g \; fixed \]
Then \(\alpha_{kg}\) is basically the coefficient of the regressor \(\omega^{(m)}_{\star,k}\) and this is a linear mixed model (LMM) with \(K\) many regressors given by \(\omega^{(m)}_{nk}\) and one random effect, for each \(g\). We solve this using the glmer function of the package lme4 in R with family poisson(), and thus obtain the estimates of the regression coefficients \(\alpha_{kg}\) and the random effect \(\beta_{b(n):g}\). We call these estimated values \(\alpha^{(m)}_{kg}\) and \(\beta^{(m)}_{bg}\) for \(b=1,2,\cdots,B\), \(k=1,2,\cdots,K\) and \(g=1,2,\cdots,G\). Given these estimates we can determine \(\omega^{(m+1)}_{nk}\) as above
\[ v^{(m+1)}_{n,\star} = \underset{v_{n,\star}\; \forall n}{argmin} \sum_{n=1}^{N}\sum_{g=1}^{G} \left [ exp \left (\sum_{k=1}^{K} f_{nk}(v_{n,\star}) \alpha^{(m)}_{kg} + \beta^{(m)}_{b(n):g} \right ) - c_{ng} * \left (\sum_{k=1}^{K} f_{nk}(v_{n,\star}) \alpha^{(m)}_{kg} + \beta^{(m)}_{b(n):g} \right ) \right ] \]
\[ \omega^{(m+1)}_{nk} = f_{nk} (v^{(m+1)}_{n,\star}) \]
We use the simulation set up same as the one presented in the previous R chunk (with the batch effect included).
The true topic proportion weights, the estimated topic proportion weight obtained from fitting the usual topic model due to Matt Taddy and that due to this mechanism that takes into account the random batch effect, are presented below in the form of Structure Plots.
This fit seems to be really good and very close to the actual topic proportion weights.
So, we discussed our transition from usual Normal Topic model to Normal Topic model with Batch effects to finally dealing with the Poisson topic model with batch effects. There are some important remarks I want to make here
The Poisson model that we discussed above, may be of statistical importance, but from a biological point of view, we would like to express the expression as a mixture of a number of Poisson variables under the topic model in this case, which is not satisified in the previous approach. So we consider a different approach. We first write down the model we are considering here.
\[ c_{ng} | \beta \sim Poi(\lambda_{n}(g)) \]
where \(\lambda_{n}(g)\) is of the form
\[ \lambda_{n}(g) = \sum_{k=1}^{K} \omega_{nk} \nu_{b(n):k}(g) \]
where
\[ log(\nu_{b(n):k}(g)) = \alpha_{kg} + \beta_{b(n):g} \]
Here \(\alpha_{kg}\) represents the fixed effect or the mixing coefficient for the topics or the clusters and the term \(\beta_{b(n):g}\) is the random intercept term for the batch \(b(n)\) and the gene \(g\). We assume that
\[ \beta_{b(n):g} \sim N(0, \sigma^{2}_{g}) \]
Fitting this model is a bit more complicated than the previous approach and I tried to use some fancy stuff like Penalized Quasi Likelihood approach, or integrating out over the random term \(\beta_{bg}\) and then maximizing the marginal log likelihood over \(\alpha_{kg}\), but these methods did not work, probably because in many of these approaches, instead of the actual log likelihood, what we ended up maximizing was a lower bound to the log likelihood obtained using Jensen’s inequality. The method that finally gave some decent answers is presented here.
First we start with some initial estimates of \(\omega^{(0)}_{nk}\) and \(\alpha^{(0)}_{kg}\). For the \(m\)th iteration, we can write
\[ log (\lambda_{n}(g)) = \beta_{b(n):g} + log \left (\sum_{k=1}^{K}\omega^{(m)}_{nk} \alpha^{(m)}_{kg}\right) \]
This would reduce to a GLMM (Generalized Linear Mixed Model) with an intercept term given by \(log(\sum_{k=1}^{K}\omega^{(m)}_{nk} \alpha^{(m)}_{kg})\) and the random intercept given by \(\beta_{b(n):g}\). However, in R, the glmer function in lme4 package has a problem with the fact that the iterative scheme sometimes is not converging. So, instead, what I assumed was a Normal LMM where considered
\[ log(c_{ng}+\epsilon) = log \left (\sum_{k=1}^{K}\omega^{(m)}_{nk} \alpha^{(m)}_{kg} \right ) + \beta_{b(n):g} + e_{ng} \]
where \(log \left ( \sum_{k=1}^{K}\omega^{(m)}_{nk} \alpha^{(m)}_{kg} \right)\) is the offset term. \(\epsilon\) is a small perturbation (here chosen as \(\epsilon = 1e-06\)) to counter the case of \(c_{ng}\) as \(0\). Using this normal LMM model, we obtain the random intercept term predictors \(\beta^{(m)}_{b:g}\).
Now given \(\beta^{(m)}_{b:g}\) and \(\omega^{(m)}_{nk}\), we estimate the fixed effect \(\alpha_{kg}\). We now write down the log likelihood in terms of \(\alpha\) as
\[ log L(\alpha,\beta^{(m)},\omega^{(m)}) = \sum_{n=1}^{N} \sum_{g=1}^{G} \sum_{k=1}^{K} \omega^{(m)}_{nk} Poi(exp(\alpha_{kg} + \beta^{(m)}_{b:g})) \]
We maximized this over each \(\alpha_{kg}\) by using the Nelder Mead unconstrained optimization algorithm and thereby obtained the estimates \(\alpha^{(m+1)}_{kg}\). Now given \(\alpha^{(m+1)}_{kg}\) and \(\beta^{(m)}_{bg}\), we obtain the refined estimates \(\omega^{(m+1)}_{nk}\) using the same approach as before.
Now we present some simulation study results. We considered two cases, once with 2 batches and the other with 5 batches. The simulation design for 2 (5) batches case is presented below
K=4;
G=100;
N=500;
alpha_true=matrix(rnorm((K)*G,1,3),nrow=(K)); ### the matrix of fixed effects
Label.Batch=c(rep(1,N/2),rep(2,N/2)); # number of batches =2
Label.Batch=c(rep(1,N/5),rep(2,N/5),rep(3,N/5),rep(4,N/5),rep(5,N/5)); # number of batches =5
B=max(Label.Batch);
sigmab_true=1;
beta_true=matrix(0,B,G); ### the matrix of the random effect
for(g in 1:G)
{
beta_true[,g]=rnorm(B,mean=0,sd=sigmab_true);
}
library(gtools)
T=10;
omega_true=matrix(rbind(rdirichlet(T*10,c(3,4,2,6)),rdirichlet(T*10,c(1,4,6,3)),
rdirichlet(T*10,c(4,1,2,2)),rdirichlet(T*10,c(2,6,3,2)),
rdirichlet(T*10,c(3,3,5,4))), nrow=N);
### generating the table
read_counts=matrix(0,N,G);
for(n in 1:N)
{
for(g in 1:G)
{
mean=omega_true[n,]%*%exp(alpha_true[,g] +beta_true[Label.Batch[n],g]);
read_counts[n,g]=rpois(1,mean);
}
}
The following graphs depict the Structure plots for the true topic proportions, the estimated topic proportions as per the usual Topic model and the estimated topic proportions as per our model as follows
The graphs for 5 batches is given as follows
For number of batches \(2\), our estimated plot and the actual plot are pretty close, for number of batches \(5\), it is still matching to a great extent but not so exact. These fluctuations could be because of data sizes which are not too big in each batch…
The Poisson topic model (Ver 1) I considered is quite similar to the one used by Xiang. The difference however I guess lies in the fact that Xiang’s approach is probably fully Bayesian which is not the case in the function glmer(). It would therefore make sense to compare this approach to that of Xiang’s (after modifying it to include more than one regressor).
The current code is much faster than the previous versions I wrote and I am hopeful it will work well on big datasets like the Jaiitn et al. dataset or the GTEX dataset. Also, since the GLMM or the LMM steps are for fixed \(g\) and over the samples, I am guessing it would be easier to handle these functions and may be parallelize over the \(G\) genes